Repositorio: https://github.com/Ruhguevara/challenge_cdc
Carpeta Drive: https://drive.google.com/drive/folders/1HN0uV-YV8Yyo4FJ04ex3OoqIbAVmua0t?usp=sharing
Problema:
La empresa necesita realizar un modelo de fraude. Tu papel como Científico de datos es dar una solución de modelo y mitigar los riesgos de fraude basada en datos. Los datos para realizar el modelo son datos.csv. Sugerencia de pasos a considerar en tu modelo de fraude:
Entregables:
Recuerda que en un contexto de trabajo en equipo, las personas que leerán tu código puede que no estuvieron involucradas en su desarrollo pero igual tendrán que entendenderlo y/o mantenerlo. Además deberás comentar cada sección y compartir tus hallazgos.
Datos:
I. Análisis Exploratorio de Datos y Limpieza
II. Estadísticas
III. Pre-procesamiento
IV. Modelos
V. Conclusiones
Se sugiere crear un ambiente virtual con:
python -m venv venv -> source venv/Scripts/activate
Luego de activar el ambiente virtual, instalar los requisitos:
pip install -r requirements.txt
Ejecutar jupyter notebook:
jupyter notebook
# Manipulación de datos y análisis
import pandas as pd
import numpy as np
# Visualización
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display_html
# Machine Learning
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, roc_curve, confusion_matrix, make_scorer, roc_auc_score
# Librerías Adicionales
from collections import Counter
from EDA import DataExplorer as de # Script hecho por mí
import time
# Configuraciones
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
matplotlib.style.use('seaborn')
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
start_time = time.time()
Parquet como almacenamiento de datos reduce el tamaño del archivo, el tiempo de procesamiento y los costos asociados a esto, esto se traduce en dinero, tiempo y almacenamiento ahorrado.
Fuentes:
Documentación
Descripción General
La versión 2.0 de pandas introduce el backend de Apache Arrow, lo cual permite hacer más eficiente la manera en la que los datos son almacenados en memoria:
Por ejemplo:
Para la primera parte del proyecto, trabajaré con el backend de pyarrow, como se trata de la más reciente versión de pandas, hay algunos problemas de compatibilidad al momento de modelar.
Otra alternativa puede ser PySpark.
Ejemplos de mejora de lectura de datos con parquet y pyarrow:
El código se ejecutó n veces, realizando 10 ciclos cada una, el tiempo promedio fue de X ms (milisegundos), con una desviación estándar de ± X ms.
%%time
# Cargar datos
df = pd.read_parquet('data/datos_fraude.parquet', engine='pyarrow', dtype_backend='pyarrow')
CPU times: total: 188 ms Wall time: 130 ms
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 36 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction_id 284807 non-null string[pyarrow] 1 timestamp 284807 non-null double[pyarrow] 2 amount 284807 non-null double[pyarrow] 3 variable_01 284807 non-null double[pyarrow] 4 variable_02 284807 non-null double[pyarrow] 5 variable_03 284807 non-null double[pyarrow] 6 variable_04 284807 non-null double[pyarrow] 7 variable_05 284807 non-null double[pyarrow] 8 variable_06 284807 non-null double[pyarrow] 9 variable_07 284807 non-null double[pyarrow] 10 variable_08 284807 non-null double[pyarrow] 11 variable_09 284807 non-null double[pyarrow] 12 variable_10 284807 non-null double[pyarrow] 13 variable_11 284807 non-null double[pyarrow] 14 variable_12 284807 non-null double[pyarrow] 15 variable_13 284807 non-null double[pyarrow] 16 variable_14 284807 non-null double[pyarrow] 17 variable_15 284807 non-null double[pyarrow] 18 variable_16 284807 non-null double[pyarrow] 19 variable_17 284807 non-null double[pyarrow] 20 variable_18 284807 non-null double[pyarrow] 21 variable_19 284807 non-null double[pyarrow] 22 variable_20 284807 non-null double[pyarrow] 23 variable_21 284807 non-null double[pyarrow] 24 variable_22 284807 non-null double[pyarrow] 25 variable_23 284807 non-null double[pyarrow] 26 variable_24 284807 non-null double[pyarrow] 27 variable_25 284807 non-null double[pyarrow] 28 variable_26 284807 non-null double[pyarrow] 29 variable_27 284807 non-null double[pyarrow] 30 variable_28 284807 non-null double[pyarrow] 31 variable_29 284807 non-null double[pyarrow] 32 variable_30 284807 non-null double[pyarrow] 33 variable_31 284807 non-null double[pyarrow] 34 variable_32 284807 non-null double[pyarrow] 35 is_fraud 284807 non-null int64[pyarrow] dtypes: double[pyarrow](34), int64[pyarrow](1), string[pyarrow](1) memory usage: 95.7 MB
display_html(df.head(2), df.sample(2), df.tail(2))
| transaction_id | timestamp | amount | variable_01 | variable_02 | variable_03 | variable_04 | variable_05 | variable_06 | variable_07 | variable_08 | variable_09 | variable_10 | variable_11 | variable_12 | variable_13 | variable_14 | variable_15 | variable_16 | variable_17 | variable_18 | variable_19 | variable_20 | variable_21 | variable_22 | variable_23 | variable_24 | variable_25 | variable_26 | variable_27 | variable_28 | variable_29 | variable_30 | variable_31 | variable_32 | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 99899e9e02c4b41fc442744220e6fa12f8d36b26f70027... | 155270.0 | 12.00 | -0.071330 | -0.032900 | 0.109989 | 0.339889 | -0.626131 | -0.116853 | 1.220826 | 0.394041 | 0.051705 | 0.700084 | -1.185263 | -0.183050 | 1.051029 | 0.267423 | -0.220569 | 1.358207 | -0.321922 | -1.121246 | 0.852400 | -0.635935 | -0.445327 | -0.251412 | -0.989219 | -0.168169 | -1.054944 | -1.603176 | -0.616640 | 2.283078 | 0.373964 | 1.576543 | -0.941557 | -0.10528 | 0 |
| 1 | d678605da2ed45d14c95228b2e6a0daa1c635c7c5d3f7c... | 46054.0 | 208.89 | 0.057858 | 0.003669 | 0.076745 | 0.392782 | 0.458835 | -0.279094 | 0.435257 | 0.230350 | 0.292049 | 0.624199 | 0.327219 | -0.032708 | -0.372193 | 1.440826 | -0.049183 | -0.595081 | 0.376050 | -0.518996 | -1.245331 | 2.034140 | -0.073293 | -0.439827 | -0.863551 | -1.606923 | -1.100937 | 0.597263 | -1.228029 | 0.875399 | 0.260934 | -0.558290 | 0.763729 | 0.01174 | 0 |
| transaction_id | timestamp | amount | variable_01 | variable_02 | variable_03 | variable_04 | variable_05 | variable_06 | variable_07 | variable_08 | variable_09 | variable_10 | variable_11 | variable_12 | variable_13 | variable_14 | variable_15 | variable_16 | variable_17 | variable_18 | variable_19 | variable_20 | variable_21 | variable_22 | variable_23 | variable_24 | variable_25 | variable_26 | variable_27 | variable_28 | variable_29 | variable_30 | variable_31 | variable_32 | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 86648 | c1b59f74b76b9a2091e4d9668a0eb213d7fca895f1e605... | 172476.0 | 79.00 | 0.106614 | 0.280808 | -0.041085 | 0.591310 | -0.96821 | -0.099984 | 0.211604 | -0.028444 | 0.058716 | -1.316281 | -0.518383 | 0.503358 | -0.753112 | -0.535531 | -0.577345 | 0.827327 | -0.253848 | 0.561585 | 0.579010 | -2.297606 | 0.650367 | -0.596290 | 1.658007 | 0.676568 | -3.370574 | 1.176248 | -1.094176 | -1.175035 | -0.139689 | -1.129669 | 1.407305 | 0.898586 | 0 |
| 80544 | 7f41c3a3da7258410480bdb43eebdf79c348fd397762a9... | 59338.0 | 139.72 | 0.962821 | -2.579400 | 0.425745 | -0.409096 | -0.17863 | 0.632078 | 0.285500 | -0.728802 | -2.323487 | -1.224457 | -1.242226 | -0.528219 | 0.271861 | -1.186460 | -1.246944 | -0.620173 | 0.818335 | 1.263754 | 0.518457 | 2.026752 | -0.316200 | 0.968524 | 0.690563 | -0.361060 | -0.934445 | 0.633143 | -2.452645 | -5.229390 | 1.447533 | 0.407791 | 12.709237 | -8.254080 | 0 |
| transaction_id | timestamp | amount | variable_01 | variable_02 | variable_03 | variable_04 | variable_05 | variable_06 | variable_07 | variable_08 | variable_09 | variable_10 | variable_11 | variable_12 | variable_13 | variable_14 | variable_15 | variable_16 | variable_17 | variable_18 | variable_19 | variable_20 | variable_21 | variable_22 | variable_23 | variable_24 | variable_25 | variable_26 | variable_27 | variable_28 | variable_29 | variable_30 | variable_31 | variable_32 | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 284805 | 0b243a55869df946b3b78f84ec5554da161f89a1181c12... | 167436.0 | 256.58 | 0.093611 | 0.051865 | -0.410531 | 0.685550 | -0.965304 | 0.269152 | 0.990574 | 0.476423 | 0.581536 | -1.557949 | -0.044761 | -0.328253 | -0.077867 | -0.994052 | -0.818322 | 0.265405 | 0.990591 | -0.001446 | -1.343083 | 0.590212 | 1.086074 | -0.264151 | 2.972262 | 0.807602 | -0.016152 | 2.330076 | -1.272419 | -2.035372 | -1.395804 | -0.116800 | 1.235671 | 0.165967 | 0 |
| 284806 | 42b724f0d8f3ebe8b3088bccd5edb3b8fc1870af1de24c... | 150148.0 | 4.26 | 0.220985 | 0.234333 | 0.542929 | -1.255929 | -0.994163 | 0.023618 | -0.219682 | -0.144960 | 0.251301 | 3.495537 | 0.440162 | 0.014331 | -0.742116 | 0.520878 | 0.480816 | 0.545250 | 0.433750 | 0.104661 | 0.347982 | -0.763714 | -0.064661 | 0.552298 | 0.247960 | 0.724035 | 0.744822 | -0.005119 | 0.804242 | 0.447816 | 1.845957 | -1.113174 | 2.916996 | 0.749865 | 0 |
df_shape_1 = df.shape
print(df_shape_1)
(284807, 36)
# Class
class_val = Counter(df['is_fraud'])
class_val
Counter({0: 284315, 1: 492})
Hay un gran nivel de desbalance, esto será un obstáculo considerable al momento de modelar, en el mundo real, lo primero que haría es conseguir una muestra más grande de datos, si bien la proporción puede mantenerse, lo importante es que la clase minoritaria aumente de manera más orgánica.
En cuanto a si aplicar Subsampling, esto implica una reducción muy considerable de información. Aunque un punto favorable es la reducción de tiempos de procesamiento, y con esto, costos computacionales y monetarios asociados. Por otro lado, el oversampling implica la creación sintética de clases, incrementan los tiempos de procesamiento y la introducción de ruído.
Al final, hay un costo de oportunidad asociado a qué hacer con el desbalance de clases, considero que no vale la pena trabajar unicamente con datos reducidos (subsampling) con la excusa de ahorrar recursos computacionales y/o económicos, considero que hay muchas otras maneras de ahorrar estos recursos que no ponen en juego la efectividad del modelo; lo más importante es tener un modelo efectivo, pues hay posibles ganancias o pérdidas asociadas directamente. Se trata de un proceso recursivo en donde los resultados de modelos pasados, alimentan futuros modelos, esto es, una inversión.
%%time
df[["amount", 'is_fraud']].sort_values(by = "amount", ascending = False).head(10).style.bar(subset=["amount"], color='lightgreen')
CPU times: total: 31.2 ms Wall time: 104 ms
| amount | is_fraud | |
|---|---|---|
| 162976 | 25691.160000 | 0 |
| 63495 | 19656.530000 | 0 |
| 248934 | 18910.000000 | 0 |
| 100173 | 12910.930000 | 0 |
| 59354 | 11898.090000 | 0 |
| 265206 | 11789.840000 | 0 |
| 232178 | 10199.440000 | 0 |
| 268189 | 10000.000000 | 0 |
| 143634 | 8790.260000 | 0 |
| 111365 | 8787.000000 | 0 |
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 6))
sns.set(style="whitegrid")
sns.boxplot(ax=ax1, x="is_fraud", y="amount", hue="is_fraud", data=df, showfliers=True)
sns.boxplot(ax=ax2, x="is_fraud", y="amount", hue="is_fraud", data=df, showfliers=False)
ax1.set_title("Box Plot with Outliers")
ax2.set_title("Box Plot without Outliers")
plt.tight_layout()
plt.show()
El parámetro showfliers muestra el boxplot con o sin el efecto de los outliers, en el lado izquierdo, se puede ver que los outliers de la variable amount están concentrados en los casos negativos de fraudes 0, mientras que para los casos de fraudes 1, no hay outliers tan significativos. Si se remueven los outliers como en la gráfica derecha, podemos ver que el caso ahora es opuesto, los casos de fraude tienen estadísticos más variables.
El conservar los outliers, en este caso de amount, puede ayudar a identificar mejor cada caso.
total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', '%']).transpose()
| transaction_id | timestamp | variable_18 | variable_19 | variable_20 | variable_21 | variable_22 | variable_23 | variable_24 | variable_25 | variable_26 | variable_27 | variable_28 | variable_29 | variable_30 | variable_31 | variable_32 | variable_17 | variable_16 | variable_15 | variable_06 | amount | variable_01 | variable_02 | variable_03 | variable_04 | variable_05 | variable_07 | variable_14 | variable_08 | variable_09 | variable_10 | variable_11 | variable_12 | variable_13 | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| % | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
No parece haber valores faltantes. De haberlos, hay varias ténicas para inputación de datos:
print("Número de Duplicados: ", df.duplicated().sum())
Número de Duplicados: 1081
# La función Counter permite hacer un conteo ordenado, y con esto observar los datos repetidos
# Tiene métodos como .most_common(n) que muestra los n más comunes.
c = Counter(df["transaction_id"])
c.most_common(5)
[('10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d113819315d49a600425e', 18),
('0ee9ea07663c819692ffe399076509ff60095d6d86739f3ccf79c157044e6335', 18),
('613d73b3fb15878d372c710b2d4dfac57615e34b4151cb13c780b120a9498143', 9),
('97d77994bd9aa20e2f53169ecc6e88450a53291b70dbb3dca454bc66e1fdd98c', 9),
('e6c63670531c4bdaef9f79b16d0251bee8d63e05b730e473bb007a2a175261dd', 6)]
# Observar los transaction_id más repetidos (18 veces)
df[df["transaction_id"] == '10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d113819315d49a600425e'] #.loc[:, \ ["transaction_id", "timestamp", "amount", "is_fraud"]]
| transaction_id | timestamp | amount | variable_01 | variable_02 | variable_03 | variable_04 | variable_05 | variable_06 | variable_07 | variable_08 | variable_09 | variable_10 | variable_11 | variable_12 | variable_13 | variable_14 | variable_15 | variable_16 | variable_17 | variable_18 | variable_19 | variable_20 | variable_21 | variable_22 | variable_23 | variable_24 | variable_25 | variable_26 | variable_27 | variable_28 | variable_29 | variable_30 | variable_31 | variable_32 | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1668 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 12505 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 21665 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 54046 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 57993 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 91179 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 128784 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 135084 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 135582 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 137614 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 148200 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 166165 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 221752 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 240018 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 251047 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 265380 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 272423 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
| 282895 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
duplicated_class_val = Counter(df[df.duplicated()]["is_fraud"])
duplicated_class_val
Counter({0: 1062, 1: 19})
print(f'Porcentaje de la clase 0 (no fraude) duplicados: {(duplicated_class_val[0]/class_val[0])*100 :.4f}%')
print(f'Porcentaje de la clase 1 (fraude) duplicados: {(duplicated_class_val[1]/class_val[1])*100 :.4f}%')
Porcentaje de la clase 0 (no fraude) duplicados: 0.3735% Porcentaje de la clase 1 (fraude) duplicados: 3.8618%
Se observa que efectivamente todos los datos están repetidos, puede ser debido a un error, los repetidos serán eliminados.
Aunque esto implica eliminar una mayor proporción: 3.8618% de la clase 1 (fraude) que de la clase 0 (no fraude): 0.3735%, por lo que el desbalance de clases será más notorio.
Como sea, se busca que el modelo sea capaz de generalizar y predecir correctamente datos nuevos, y que no se aprenda únicamente los datos de entrenamiento.
# Eliminar duplicados
df = df.drop_duplicates().reset_index(drop=True)
# Comprobando que ya no haya más repetidos, ahora solo hay un resultado:
df[df["transaction_id"] == '10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d113819315d49a600425e'] #.loc[:, \ ["transaction_id", "timestamp", "amount", "is_fraud"]]
| transaction_id | timestamp | amount | variable_01 | variable_02 | variable_03 | variable_04 | variable_05 | variable_06 | variable_07 | variable_08 | variable_09 | variable_10 | variable_11 | variable_12 | variable_13 | variable_14 | variable_15 | variable_16 | variable_17 | variable_18 | variable_19 | variable_20 | variable_21 | variable_22 | variable_23 | variable_24 | variable_25 | variable_26 | variable_27 | variable_28 | variable_29 | variable_30 | variable_31 | variable_32 | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1668 | 10e5320e2a3c4eeaf6a04936e0ac114903e1b3030c0d11... | 163152.0 | 7.56 | -0.858197 | -0.035702 | 0.327804 | -0.133198 | -0.86979 | -0.35517 | 0.524395 | -0.370294 | 1.37579 | 0.234834 | -0.412833 | -0.228008 | -0.78521 | 1.693218 | -1.564352 | -0.256168 | -0.709405 | 1.385525 | 4.217934 | 0.122648 | -0.156001 | 0.585362 | 3.717077 | 1.511706 | 3.378471 | 2.883976 | 1.585949 | -1.196037 | 1.114535 | -1.177816 | -11.328203 | -0.114247 | 0 |
print("Número de duplicados: ", df.duplicated().sum())
Número de duplicados: 0
df_shape_2 = df.shape
print(df_shape_1)
print(df_shape_2)
(284807, 36) (283726, 36)
# # Valores Únicos
# El resultado es demasiado grande para ser mostrado, aunque no hay ninguna clase con valores únicos significativos.
# for i in df.columns:
# print(f"Unique value in {i}:")
# print(df[i].unique(),'\n')
# plt.figure(figsize=(24, 18))
# heatmap = sns.heatmap(df.iloc[:, 1:].corr(), cmap='BrBG')
# heatmap.set_title('Mapa de Correlación', fontdict={'fontsize':18}, pad=12);
fig = px.imshow(df.iloc[:, 1:].corr().round(1), text_auto=True, aspect="auto", color_continuous_scale=px.colors.sequential.Blues)
fig.layout.height = 600
fig.layout.width = 1050
fig.update_coloraxes(showscale=False)
fig.update_layout(
title_text="Mapa de Correlación")
fig.show()
Hay 4 correlaciones perfectas entre las variables:
Las variables a partir de v_29 serán eliminadas
Justificación:
df.drop(columns = ["variable_29", "variable_30", "variable_31", "variable_32"], inplace = True)
%%time
de.generate_data_description_table(df.loc[:, ~df.columns.isin(['is_fraud', 'transaction_id', 'timestamp', 'amount'])])
CPU times: total: 281 ms Wall time: 407 ms
| count | mean | std | min | 25% | 50% | 75% | max | missing | |
|---|---|---|---|---|---|---|---|---|---|
| variable_22 | 283726.0000 | 0.0018 | 1.2277 | -43.5572 | -0.5525 | 0.0409 | 0.5705 | 120.5895 | 0.0000 |
| variable_23 | 283726.0000 | -0.0011 | 1.3319 | -26.1605 | -0.7690 | -0.2752 | 0.3968 | 73.3016 | 0.0000 |
| variable_09 | 283726.0000 | 0.0002 | 0.7700 | -54.4977 | -0.2115 | -0.0624 | 0.1332 | 39.4209 | 0.0000 |
| variable_24 | 283726.0000 | 0.0018 | 1.3770 | -113.7433 | -0.6898 | -0.0535 | 0.6122 | 34.8017 | 0.0000 |
| variable_01 | 283726.0000 | 0.0005 | 0.3280 | -15.4301 | -0.0528 | 0.0113 | 0.0783 | 33.8478 | 0.0000 |
| variable_02 | 283726.0000 | 0.0018 | 0.3957 | -22.5657 | -0.0706 | 0.0015 | 0.0912 | 31.6122 | 0.0000 |
| variable_08 | 283726.0000 | -0.0004 | 0.7239 | -34.8304 | -0.2283 | -0.0294 | 0.1862 | 27.2028 | 0.0000 |
| variable_19 | 283726.0000 | -0.0014 | 1.0764 | -24.5883 | -0.5356 | -0.0932 | 0.4536 | 23.7451 | 0.0000 |
| variable_06 | 283726.0000 | 0.0002 | 0.6237 | -44.8077 | -0.1617 | -0.0112 | 0.1477 | 22.5284 | 0.0000 |
| variable_27 | 283726.0000 | -0.0041 | 1.6467 | -72.7157 | -0.6003 | 0.0639 | 0.8003 | 22.0577 | 0.0000 |
| variable_21 | 283726.0000 | -0.0009 | 1.1791 | -73.2167 | -0.2088 | 0.0219 | 0.3257 | 20.0072 | 0.0000 |
| variable_13 | 283726.0000 | 0.0012 | 0.8737 | -14.1299 | -0.4669 | 0.0671 | 0.5235 | 17.3151 | 0.0000 |
| variable_25 | 283726.0000 | -0.0030 | 1.4142 | -5.6832 | -0.8501 | -0.0222 | 0.7396 | 16.8753 | 0.0000 |
| variable_20 | 283726.0000 | -0.0016 | 1.0955 | -13.4341 | -0.6442 | -0.0526 | 0.5960 | 15.5950 | 0.0000 |
| variable_18 | 283726.0000 | 0.0002 | 1.0187 | -4.7975 | -0.7616 | -0.0323 | 0.7396 | 12.0189 | 0.0000 |
| variable_15 | 283726.0000 | 0.0003 | 0.9522 | -19.2143 | -0.4257 | 0.0502 | 0.4923 | 10.5268 | 0.0000 |
| variable_07 | 283726.0000 | -0.0000 | 0.7246 | -10.9331 | -0.5427 | 0.0067 | 0.5282 | 10.5031 | 0.0000 |
| variable_26 | 283726.0000 | 0.0016 | 1.5087 | -48.3256 | -0.8897 | 0.1800 | 1.0270 | 9.3826 | 0.0000 |
| variable_12 | 283726.0000 | 0.0002 | 0.8425 | -25.1628 | -0.4839 | -0.0659 | 0.3990 | 9.2535 | 0.0000 |
| variable_14 | 283726.0000 | 0.0010 | 0.9149 | -4.4989 | -0.5815 | 0.0493 | 0.6501 | 8.8777 | 0.0000 |
| variable_17 | 283726.0000 | -0.0007 | 0.9947 | -18.6837 | -0.4062 | 0.1391 | 0.6170 | 7.8484 | 0.0000 |
| variable_04 | 283726.0000 | -0.0002 | 0.5212 | -10.2954 | -0.3175 | 0.0163 | 0.3507 | 7.5196 | 0.0000 |
| variable_16 | 283726.0000 | 0.0006 | 0.9954 | -5.7919 | -0.6479 | -0.0129 | 0.6632 | 7.1269 | 0.0000 |
| variable_10 | 283726.0000 | -0.0003 | 0.8134 | -7.2135 | -0.4563 | 0.0034 | 0.4585 | 5.5920 | 0.0000 |
| variable_11 | 283726.0000 | 0.0015 | 0.8374 | -9.4987 | -0.4980 | -0.0021 | 0.5020 | 5.0411 | 0.0000 |
| variable_05 | 283726.0000 | 0.0002 | 0.6056 | -2.8366 | -0.3545 | 0.0410 | 0.4397 | 4.5845 | 0.0000 |
| variable_03 | 283726.0000 | 0.0001 | 0.4821 | -2.6046 | -0.3268 | -0.0522 | 0.2403 | 3.5173 | 0.0000 |
| variable_28 | 283726.0000 | 0.0059 | 1.9480 | -56.4075 | -0.9160 | 0.0204 | 1.3161 | 2.4549 | 0.0000 |
%%time
# Estadística Descriptiva para Amount
de.generate_data_description_table(df.loc[:, ['amount']])
CPU times: total: 0 ns Wall time: 16.3 ms
| count | mean | std | min | 25% | 50% | 75% | max | missing | |
|---|---|---|---|---|---|---|---|---|---|
| amount | 283726.0000 | 88.4727 | 250.3994 | 0.0000 | 5.6000 | 22.0000 | 77.5100 | 25691.1600 | 0.0000 |
# # Box & Whisker Plot
# num_columns = len(df.columns)
# num_plots_per_row = 2
# num_rows = (num_columns + num_plots_per_row - 1) // num_plots_per_row
# fig, axes = plt.subplots(num_rows, num_plots_per_row, figsize=(15, 5*num_rows))
# for i, col in enumerate(df.iloc[:, 1:].columns):
# ax = axes[i // num_plots_per_row, i % num_plots_per_row]
# sns.boxplot(x=df[col], ax=ax)
# ax.set_title(col)
# plt.tight_layout()
# plt.savefig("imgs/box_plots.png")
# plt.show()
Según la gráfica, hay outliers bastante marcados, por cuestiones de tiempo no hice su análisis, pero consideraría probar a eliminarlos y ver si hay efecto alguno en el modelo.
# num_columns = len(df.columns)
# num_plots_per_row = 3
# num_rows = (num_columns + num_plots_per_row - 1) // num_plots_per_row
# fig, axes = plt.subplots(num_rows, num_plots_per_row, figsize=(15, 5*num_rows))
# for i, col in enumerate(df.iloc[:, 1:].columns):
# ax = axes[i // num_plots_per_row, i % num_plots_per_row]
# sns.distplot(df[col], ax=ax)
# ax.set_title(col)
# plt.tight_layout()
# plt.savefig("imgs/distributions.png")
# plt.show()
Parece haber distribuciones bastante marcadas, en caso de haber datos faltantes, se puede hacer un análisis de distribución con el método fitter, el cual analiza de manera efectiva los datos y los ajusta a su distribución más parecida, con esto, se pueden generar los datos faltantes basados en la misma distribución. Esto evita sesgar los datos como lo haría si se utiliza la media o moda. Otra opción es la inputación con un modelo.
class_count = pd.value_counts(df['is_fraud'], sort = True).sort_index()
class_count
is_fraud 0 283253 1 473 Name: count, dtype: int64[pyarrow]
%%time
# Composición de la Clase - Gráfica de Pastel
fig = go.Figure(data=[go.Pie(labels=['No Fraude', 'Fraude'],
values=class_count,
pull=[0.1, 0, 0],
opacity=0.85)])
fig.update_layout(
title_text="Composición de Clase")
fig.show()
CPU times: total: 0 ns Wall time: 8.11 ms
%%time
# Composición de la Clase - Histograma
fig = px.histogram(df, x="is_fraud",
title='Class Composition',
opacity=0.85, # represent bars with log scale
color = 'is_fraud', text_auto=True)
fig.show()
CPU times: total: 78.1 ms Wall time: 221 ms
print('No Fraude:', round(df['is_fraud'].value_counts()[0]/len(df) * 100,2), '% del dataset')
print('Fraude:', round(df['is_fraud'].value_counts()[1]/len(df) * 100,2), '% del dataset')
No Fraude: 99.83 % del dataset Fraude: 0.17 % del dataset
Hay un claro desbalance de clases, esto puede afectar negativamente todo el proceso que sigue. Algunas soluciones a esto son:
for i in range(1, 29):
column_name = f"variable_{i:02d}"
df[column_name] = df[column_name].astype(float)
df['transaction_id'] = df['transaction_id'].astype(str)
df['timestamp'] = df['timestamp'].astype(float)
df['amount'] = df['amount'].astype(float)
df['is_fraud'] = df['is_fraud'].astype(int)
df.dtypes
transaction_id object timestamp float64 amount float64 variable_01 float64 variable_02 float64 variable_03 float64 variable_04 float64 variable_05 float64 variable_06 float64 variable_07 float64 variable_08 float64 variable_09 float64 variable_10 float64 variable_11 float64 variable_12 float64 variable_13 float64 variable_14 float64 variable_15 float64 variable_16 float64 variable_17 float64 variable_18 float64 variable_19 float64 variable_20 float64 variable_21 float64 variable_22 float64 variable_23 float64 variable_24 float64 variable_25 float64 variable_26 float64 variable_27 float64 variable_28 float64 is_fraud int32 dtype: object
# La mayoría de los datos parece que ya han sido escalados, sin embargo, amount y timestamp no.
rob_scaler = RobustScaler() # RobustScaler es menos suceptible a outliers
df['scaled_amount'] = rob_scaler.fit_transform(df['amount'].values.reshape(-1,1))
df['scaled_timestamp'] = rob_scaler.fit_transform(df['timestamp'].values.reshape(-1,1))
df.drop(['amount','timestamp'], axis=1, inplace=True)
scaled_amount = df['scaled_amount']
scaled_time = df['scaled_timestamp']
df.drop(['scaled_amount', 'scaled_timestamp'], axis=1, inplace=True)
df.insert(1, 'scaled_amount', scaled_amount)
df.insert(2, 'scaled_timestamp', scaled_time)
Cada una tiene ventajas y desventajas; Undersampling puede hacer que los tiempos de procesamiento mejoren, pero implica una pérdida de información, Oversampling aumenta la cantidad de muestras de la clase minoritaria, pero estas no son 'orgánicas', y el tamaño del dataset se incrementa. En cuanto al método de asignación de pesos, tiene la ventaja de que es un poco más fundamentado su implementación, ya que se pueden optimizar,
Para proseguir, decidí utilizar una combinación de Undersampling (RandomUnderSampler) y Oversampling (SMOTE) y Class Weights, para que la combinación reduzca un poco el impacto negativo que tendría el utilizar cada uno de manera individual.
# Oversampling y Undersampling
under = RandomUnderSampler(sampling_strategy = 0.1) # Se reduce el ratio de la clase mayoritaria a 0.1
over = SMOTE(sampling_strategy = 0.5) # Se incrementa el ratio de la clase minoritaria a 0.5
# Guardar como array las features y la clase
X = df.iloc[:, 1:-1].values
y = df.iloc[:, -1].values
# Se crea un Pipeline que ejecute Oversampling y Undersampling
steps = [('under', under),('over', over)]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y) # Se hace el resample ya con ambos procesos realizados
Counter(y)
Counter({0: 4730, 1: 2365})
y = y.reshape((-1, 1))
df_resampled = np.concatenate((X, y), axis = 1)
df_resampled = pd.DataFrame(df_resampled, columns = df.iloc[:, 1:].columns)
# Separar datos después de aplicar Oversampling y Undersampling
x_train1, x_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.25, random_state = 42)
Con los datos reducidos y balanceados, podemos tener una mejor vista a la correlación, hisogramas, boxplots, distribuciones y otras gráficas, hice la de correlación como ejemplo para ver su cambio.
Para observar los demás cambios con mucho más detalle, hice otro notebook llamado dataprep_report_PPS, como demostración de la librería de EDA automático llamada dataprep y de una alternativa a la correlación llamada PPS.
df_corr = df.iloc[:, 1:].corr()
fig = px.imshow(df_corr[["is_fraud", 'scaled_amount']], text_auto=True, aspect="auto", color_continuous_scale=px.colors.sequential.Blues)
fig.layout.height = 600
fig.layout.width = 1050
fig.update_coloraxes(showscale=False)
fig.update_layout(
title_text="Mapa de Correlación")
fig.show()
Donde $X$ es el set de características predictoras y $\beta$ es el correspondiente vector de pesos. Computando $S(x)$ se produce una probabilidad que indica si una observación debería ser clasificada como `1` o `0`.
<br>
Es altamente interpretable debido a su output (probabilidades) y esto lo hace más fácil de explicar a público no tan técnico, a comparación de otros modelos.
Computacionalmente, requiere menos poder a comparación de otros modelos más complejos como Redes Neuronales o Árboles de Decisión.
# Regresión Logística con Subsampling, Oversampling y Pesos Clase
LR = LogisticRegression(random_state = 0, C=10, penalty= 'l2', class_weight = {0: 1, 1:3})
LR.fit(x_train1, y_train1) #.predict(X).sum()
LogisticRegression(C=10, class_weight={0: 1, 1: 3}, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LogisticRegression(C=10, class_weight={0: 1, 1: 3}, random_state=0)# refit es para que GS optimice para esa métrica
# Incrementar el espacio de búsqueda de los pesos ayuda al modelo a concentrarse en los casos de fraude
def grids(search_space: np.ndarray, opt_metric: str, cv: int):
grid_ = GridSearchCV(
estimator = LogisticRegression(max_iter = 500),
param_grid = {'class_weight': [{0: 1, 1:v} for v in search_space]},
scoring = {'precision': make_scorer(precision_score), 'recall': make_scorer(recall_score), 'f1': make_scorer(f1_score)},
refit = opt_metric,
return_train_score = True,
cv = cv,
n_jobs = -1
)
return grid_
grid_prec = grids(np.linspace(1, 20, 30), opt_metric = 'precision', cv = 10)
grid_rec = grids(np.linspace(1, 20, 30), opt_metric = 'recall', cv = 10)
grid_f1 = grids(np.linspace(1, 20, 30), opt_metric = 'f1', cv = 10)
%%time
grid_prec.fit(x_train1, y_train1)
grid_prec.best_params_['class_weight']
CPU times: total: 1.06 s Wall time: 7.06 s
{0: 1, 1: 1.0}
%%time
# ¡¡¡Notar cómo es que un modelo enfocado en detectar más fraudes (recall) asigna mucho peso al '1'!!!
grid_rec.fit(x_train1, y_train1)
grid_rec.best_params_['class_weight']
CPU times: total: 453 ms Wall time: 5.03 s
{0: 1, 1: 20.0}
%%time
grid_f1.fit(x_train1, y_train1)
grid_f1.best_params_['class_weight']
CPU times: total: 484 ms Wall time: 4.65 s
{0: 1, 1: 1.0}
# Estos son los resultados del GridSearch
# Originalmente, utiliza 'mean_test_score' para optimizar, el score por defecto es Accuracy.
# En un caso con tanto desbalance, no es buena opción utilizarlo como métrica de referencia.
# Será cambiado por precision, recall, o f1, según sea el caso. Se añaden a GridSearch.
df = pd.DataFrame(grid_prec.cv_results_)
df[['param_class_weight', 'params',
'mean_test_precision', 'mean_train_precision',
'mean_test_recall', 'mean_train_recall',
'mean_test_f1', 'mean_train_f1']]
# Estos resultados son del primer fit, o sea, con precision optimizada, por eso se escojen los pesos 0:1 y 1:1
# Notar cómo va bajando la precisión
| param_class_weight | params | mean_test_precision | mean_train_precision | mean_test_recall | mean_train_recall | mean_test_f1 | mean_train_f1 | |
|---|---|---|---|---|---|---|---|---|
| 0 | {0: 1, 1: 1.0} | {'class_weight': {0: 1, 1: 1.0}} | 0.973826 | 0.977107 | 0.900654 | 0.902701 | 0.935664 | 0.938431 |
| 1 | {0: 1, 1: 1.6551724137931034} | {'class_weight': {0: 1, 1: 1.6551724137931034}} | 0.953357 | 0.958330 | 0.911750 | 0.913121 | 0.931950 | 0.935178 |
| 2 | {0: 1, 1: 2.310344827586207} | {'class_weight': {0: 1, 1: 2.310344827586207}} | 0.937694 | 0.942063 | 0.919521 | 0.921322 | 0.928354 | 0.931575 |
| 3 | {0: 1, 1: 2.9655172413793105} | {'class_weight': {0: 1, 1: 2.9655172413793105}} | 0.925929 | 0.930141 | 0.928398 | 0.930941 | 0.926932 | 0.930538 |
| 4 | {0: 1, 1: 3.6206896551724137} | {'class_weight': {0: 1, 1: 3.6206896551724137}} | 0.911364 | 0.917799 | 0.935614 | 0.939696 | 0.923126 | 0.928617 |
| 5 | {0: 1, 1: 4.275862068965517} | {'class_weight': {0: 1, 1: 4.275862068965517}} | 0.902508 | 0.906912 | 0.943944 | 0.947897 | 0.922559 | 0.926951 |
| 6 | {0: 1, 1: 4.931034482758621} | {'class_weight': {0: 1, 1: 4.931034482758621}} | 0.892011 | 0.896963 | 0.947833 | 0.954310 | 0.918894 | 0.924747 |
| 7 | {0: 1, 1: 5.586206896551724} | {'class_weight': {0: 1, 1: 5.586206896551724}} | 0.883226 | 0.886665 | 0.953944 | 0.958441 | 0.916965 | 0.921155 |
| 8 | {0: 1, 1: 6.241379310344827} | {'class_weight': {0: 1, 1: 6.241379310344827}} | 0.872509 | 0.875835 | 0.958386 | 0.962388 | 0.913167 | 0.917069 |
| 9 | {0: 1, 1: 6.896551724137931} | {'class_weight': {0: 1, 1: 6.896551724137931}} | 0.860625 | 0.866037 | 0.960608 | 0.964854 | 0.907620 | 0.912773 |
| 10 | {0: 1, 1: 7.551724137931034} | {'class_weight': {0: 1, 1: 7.551724137931034}} | 0.850422 | 0.856123 | 0.962268 | 0.967012 | 0.902631 | 0.908191 |
| 11 | {0: 1, 1: 8.206896551724139} | {'class_weight': {0: 1, 1: 8.206896551724139}} | 0.840618 | 0.846469 | 0.964487 | 0.969725 | 0.898116 | 0.903909 |
| 12 | {0: 1, 1: 8.862068965517242} | {'class_weight': {0: 1, 1: 8.862068965517242}} | 0.832167 | 0.838451 | 0.967265 | 0.972068 | 0.894438 | 0.900324 |
| 13 | {0: 1, 1: 9.517241379310345} | {'class_weight': {0: 1, 1: 9.517241379310345}} | 0.824623 | 0.830959 | 0.968929 | 0.974596 | 0.890748 | 0.897057 |
| 14 | {0: 1, 1: 10.172413793103448} | {'class_weight': {0: 1, 1: 10.172413793103448}} | 0.819691 | 0.823909 | 0.969484 | 0.977248 | 0.888095 | 0.894043 |
| 15 | {0: 1, 1: 10.827586206896552} | {'class_weight': {0: 1, 1: 10.827586206896552}} | 0.814467 | 0.816628 | 0.970595 | 0.979282 | 0.885504 | 0.890581 |
| 16 | {0: 1, 1: 11.482758620689655} | {'class_weight': {0: 1, 1: 11.482758620689655}} | 0.809793 | 0.810322 | 0.973929 | 0.980762 | 0.884101 | 0.887424 |
| 17 | {0: 1, 1: 12.137931034482758} | {'class_weight': {0: 1, 1: 12.137931034482758}} | 0.802691 | 0.805065 | 0.975592 | 0.981749 | 0.880529 | 0.884664 |
| 18 | {0: 1, 1: 12.793103448275861} | {'class_weight': {0: 1, 1: 12.793103448275861}} | 0.796337 | 0.798801 | 0.977256 | 0.982674 | 0.877400 | 0.881239 |
| 19 | {0: 1, 1: 13.448275862068964} | {'class_weight': {0: 1, 1: 13.448275862068964}} | 0.790341 | 0.793131 | 0.980031 | 0.983475 | 0.874869 | 0.878098 |
| 20 | {0: 1, 1: 14.103448275862068} | {'class_weight': {0: 1, 1: 14.103448275862068}} | 0.783820 | 0.787720 | 0.980586 | 0.984092 | 0.871064 | 0.875013 |
| 21 | {0: 1, 1: 14.758620689655173} | {'class_weight': {0: 1, 1: 14.758620689655173}} | 0.779631 | 0.782661 | 0.980586 | 0.984523 | 0.868482 | 0.872052 |
| 22 | {0: 1, 1: 15.413793103448276} | {'class_weight': {0: 1, 1: 15.413793103448276}} | 0.773983 | 0.776934 | 0.981697 | 0.985140 | 0.865417 | 0.868728 |
| 23 | {0: 1, 1: 16.06896551724138} | {'class_weight': {0: 1, 1: 16.06896551724138}} | 0.769450 | 0.772334 | 0.982808 | 0.985818 | 0.863008 | 0.866105 |
| 24 | {0: 1, 1: 16.724137931034484} | {'class_weight': {0: 1, 1: 16.724137931034484}} | 0.762589 | 0.767229 | 0.983361 | 0.986250 | 0.858895 | 0.863049 |
| 25 | {0: 1, 1: 17.379310344827587} | {'class_weight': {0: 1, 1: 17.379310344827587}} | 0.757178 | 0.763310 | 0.983361 | 0.986805 | 0.855420 | 0.860773 |
| 26 | {0: 1, 1: 18.03448275862069} | {'class_weight': {0: 1, 1: 18.03448275862069}} | 0.750570 | 0.759323 | 0.984469 | 0.987360 | 0.851623 | 0.858441 |
| 27 | {0: 1, 1: 18.689655172413794} | {'class_weight': {0: 1, 1: 18.689655172413794}} | 0.746496 | 0.755092 | 0.984469 | 0.987668 | 0.848986 | 0.855845 |
| 28 | {0: 1, 1: 19.344827586206897} | {'class_weight': {0: 1, 1: 19.344827586206897}} | 0.743702 | 0.751497 | 0.984469 | 0.988099 | 0.847171 | 0.853690 |
| 29 | {0: 1, 1: 20.0} | {'class_weight': {0: 1, 1: 20.0}} | 0.739341 | 0.747902 | 0.986136 | 0.988593 | 0.844959 | 0.851550 |
plt.figure(figsize = (12, 4))
for score in ['mean_test_recall', 'mean_test_precision', 'mean_test_f1']:
plt.plot([_[1] for _ in df['param_class_weight']],
df[score],
label = score)
plt.legend();
# A diferencia de .predict, .predict_proba regresa las probabilidades sin el criterio del umbral
probs = grid_f1.predict_proba(x_test1)
# Se tiene un array con las probabilidades para la clase 0 y 1, en donde estas son complemento.
# Se trabajará con las probabilidades de la clase 1.
print(probs.shape)
probs_1 = probs[:, 1]
(1774, 2)
fpr, tpr, thresholds = roc_curve(y_test1, probs_1)
# Plot the ROC curve
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--') # Plot the random guess line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.show()
def print_score(label, prediction, train=True):
if train:
clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
print("Train Result:\n==========================================================================")
print("__________________________________________________________________________")
print(f"Classification Report:\n{clf_report}")
print("__________________________________________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(label, prediction)}\n")
elif train==False:
clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
print("Test Result:\n==========================================================================")
print("__________________________________________________________________________")
print(f"Classification Report:\n{clf_report}")
print("__________________________________________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(label, prediction)}\n")
# Función para mostrar la Matríz de Confusión
def CM(y_test, y_pred):
# Matriz de Confusión
cm = confusion_matrix(y_test, y_pred)
# Personalización de los datos para graficar
names = ['True Neg','False Pos','False Neg','True Pos']
counts = [value for value in cm.flatten()]
percentages = ['{0:.2%}'.format(value) for value in cm.flatten()/np.sum(cm)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(names, counts, percentages)]
labels = np.asarray(labels).reshape(2, 2)
sns.heatmap(cm, annot = labels, cmap = 'Blues', fmt ='')
prec_train_pred = grid_prec.predict(x_train1)
prec_test_pred = grid_prec.predict(x_test1)
print_score(y_train1, prec_train_pred, train=True)
print_score(y_test1, prec_test_pred, train=False)
Train Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.952120 0.976591 0.959782 0.964356 0.960407
recall 0.988917 0.902886 0.959782 0.945901 0.959782
f1-score 0.970170 0.938293 0.959782 0.954232 0.959375
support 3519.000000 1802.000000 0.959782 5321.000000 5321.000000
__________________________________________________________________________
Confusion Matrix:
[[3480 39]
[ 175 1627]]
Test Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.949126 0.967054 0.95434 0.958090 0.954815
recall 0.985962 0.886323 0.95434 0.936143 0.954340
f1-score 0.967193 0.924930 0.95434 0.946062 0.953781
support 1211.000000 563.000000 0.95434 1774.000000 1774.000000
__________________________________________________________________________
Confusion Matrix:
[[1194 17]
[ 64 499]]
CM(y_test1, prec_test_pred)
preds_prec = grid_prec.predict(x_test1)
roc_auc_score(y_test1, preds_prec)
0.9361426415348941
rec_train_pred = grid_rec.predict(x_train1)
rec_test_pred = grid_rec.predict(x_test1)
print_score(y_train1, rec_train_pred, train=True)
print_score(y_test1, rec_test_pred, train=False)
Train Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.993174 0.745295 0.881789 0.869234 0.909228
recall 0.826939 0.988901 0.881789 0.907920 0.881789
f1-score 0.902465 0.849988 0.881789 0.876227 0.884694
support 3519.000000 1802.000000 0.881789 5321.000000 5321.000000
__________________________________________________________________________
Confusion Matrix:
[[2910 609]
[ 20 1782]]
Test Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.994000 0.719638 0.874295 0.856819 0.906928
recall 0.820809 0.989343 0.874295 0.905076 0.874295
f1-score 0.899141 0.833209 0.874295 0.866175 0.878216
support 1211.000000 563.000000 0.874295 1774.000000 1774.000000
__________________________________________________________________________
Confusion Matrix:
[[994 217]
[ 6 557]]
CM(y_test1, rec_test_pred)
roc_auc_score(y_test1, rec_test_pred)
0.9050760274746147
f1_train_pred = grid_f1.predict(x_train1)
f1_test_pred = grid_f1.predict(x_test1)
print_score(y_train1, f1_train_pred, train=True)
print_score(y_test1, f1_test_pred, train=False)
Train Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.952120 0.976591 0.959782 0.964356 0.960407
recall 0.988917 0.902886 0.959782 0.945901 0.959782
f1-score 0.970170 0.938293 0.959782 0.954232 0.959375
support 3519.000000 1802.000000 0.959782 5321.000000 5321.000000
__________________________________________________________________________
Confusion Matrix:
[[3480 39]
[ 175 1627]]
Test Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.949126 0.967054 0.95434 0.958090 0.954815
recall 0.985962 0.886323 0.95434 0.936143 0.954340
f1-score 0.967193 0.924930 0.95434 0.946062 0.953781
support 1211.000000 563.000000 0.95434 1774.000000 1774.000000
__________________________________________________________________________
Confusion Matrix:
[[1194 17]
[ 64 499]]
CM(y_test1, f1_test_pred)
roc_auc_score(y_test1, f1_test_pred)
0.9361426415348941
True Positives (TP): El número de instancias positivas clasificadas correctamente como positivas.
False Positives (FP): Error tipo 1, El número de instancias negativas clasificadas incorrectamente como positivas.
True Negatives (TN): El número de instancias negativas clasificadas correctamente como negativas.
False Negatives (FN): Error tipo 2, El número de instancias positivas clasificadas incorrectamente como negativas.
Reflexión FN y FP
Un FN en este contexto significa que se realizó un crédito/préstamo/transacción cuando el usuario realmente incumple, por lo tanto, la pérdida sería el monto total del capital.
En los FP, no se emitió un crédito/préstamo/transacción cuando sí debía. Como ejemplo, hay tasa de interés del 10% en el préstamo. Esto significa que por cada falso positivo, el negocio pierde la oportunidad de obtener el 10% del capital del préstamo. Hay una relación de 10:1 entre los costos de FN y FP. Se puede usar esta relación para evaluar el clasificador probando varios umbrales y evaluando Precision y Recall ponderados.
Precision: $\quad \frac{TP}{(TP+FP)}$
predecidos como positivos.Recall: $\quad \frac{TP}{(TP+FN)}$
realmente positivos.f1: $\quad 2 \cdot \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{(\text{Precision} + \text{Recall})}$
❗️¿Qué significado tienen en este contexto 'Precision' y 'Recall'?
❗️¿A qué métrica se le debe dar mayor importancia?
Tener un Recall alto significa la detección de más casos verdaderos de fraude, optimizar aumenta TP y reduce FN, pero incrementa FP, esto significa, más acusaciones incorrectas de fraude, lo que puede perjudicar la satisfacción del cliente y la reputación de la institución, pero al mismo tiempo, se evitan pérdidas debido a detección de fraudes.
Tener una Precisión alta, significa más seguridad de que el caso es verdaderamente fraudulento. Esto implica menos acusaciones incorrectas, pero también puede haber más pérdidas debido a casos que no fueron detectados.
La decisión final es a consideración de la institución, yo he decidido ponderar igualmente ambas métricas, por lo que continuaré con los resultados de la optimización de f1 para el análisis de deciles.
Hay un detalle considerable respecto a la Regresión Logística y sus predicciones, si se tiene un threshold de 0.5, un output de 0.49 será considerado como no_fraude y uno de 0.51 como fraude, hay una diferencia absurdamente pequeña entre estos dos valores, el análisis de deciles permite explorar más a detalle estos casos y brinda una referencia extra para la toma de decisiones.
Elegí el modelo que optimiza f1 para esta parte.
y_test1 = y_test1.reshape(y_test1.shape[0],)
# Esta función la creé en la clase de la maestra Paola :)
def decile_analysis(predictions: "Array", labels: "Array"):
# Esta función toma como entrada los "predictions" (predicciones) del modelo y las "labels" (etiquetas) correspondientes.
test_pred = predictions
test_results = pd.DataFrame(
data={
"predictions": test_pred,
"label": labels
}
)
# Se crea un DataFrame llamado test_results que contiene las predicciones y las etiquetas.
test_results = test_results.sort_values(by=['predictions'], ascending=False)
test_results.reset_index(inplace=True)
# Se ordena test_results en orden descendente según las predicciones y se restablecen los índices.
deciles = np.array_split(np.array(test_results["predictions"]), 10)
dec_labels = np.array_split(np.array(test_results["label"]), 10)
# Las predicciones y las etiquetas se dividen en 10 deciles utilizando la función np.array_split.
decile_results = pd.DataFrame(
data={
"Decil": np.arange(1, 11),
"Batch": [len(deciles[i]) for i in range(len(deciles))],
"Cumulative Batch": np.cumsum([len(deciles[i]) for i in range(len(deciles))]),
"Cumulative % Batch": np.round(np.cumsum([len(deciles[i]) for i in range(len(deciles))]) / len(test_pred),
4),
"True label": [sum(dec_labels[i]) for i in range(len(dec_labels))],
"True label %": np.round(
[sum(dec_labels[i]) / sum(test_results["label"]) for i in range(len(dec_labels))], 4),
"Cumulative label %": np.round(
np.cumsum([sum(dec_labels[i]) / sum(test_results["label"]) for i in range(len(dec_labels))]), 4),
"Probability Range": [str(np.round(deciles[i].max(), 4)) + " - " + str(np.round(deciles[i].min(), 4)) for
i in range(len(deciles))]
}
)
# Se crea un nuevo DataFrame llamado decile_results que contiene los resultados del análisis de deciles.
# Los datos se calculan para cada decil, incluyendo el número de observaciones en cada decil ("Batch"),
# el número acumulativo de observaciones hasta el decil actual ("Cumulative Batch"),
# el porcentaje acumulativo de observaciones hasta el decil actual ("Cumulative % Batch"),
# el número de etiquetas verdaderas en cada decil ("True label"),
# el porcentaje de etiquetas verdaderas en relación al total de etiquetas ("True label %"),
# el porcentaje acumulativo de etiquetas verdaderas hasta el decil actual ("Cumulative label %"),
# y el rango de probabilidad para cada decil ("Probability Range").
decile_results.set_index('Decil', inplace=True)
# Se establece el índice del DataFrame decile_results como "Decil".
return decile_results
# Se devuelve el DataFrame decile_results como resultado de la función.
decile_analysis = decile_analysis(probs_1, y_test1)
decile_analysis
| Batch | Cumulative Batch | Cumulative % Batch | True label | True label % | Cumulative label % | Probability Range | |
|---|---|---|---|---|---|---|---|
| Decil | |||||||
| 1 | 178 | 178 | 0.1003 | 178 | 0.3162 | 0.3162 | 1.0 - 1.0 |
| 2 | 178 | 356 | 0.2007 | 177 | 0.3144 | 0.6306 | 1.0 - 1.0 |
| 3 | 178 | 534 | 0.3010 | 150 | 0.2664 | 0.8970 | 1.0 - 0.361 |
| 4 | 178 | 712 | 0.4014 | 41 | 0.0728 | 0.9698 | 0.3563 - 0.0881 |
| 5 | 177 | 889 | 0.5011 | 10 | 0.0178 | 0.9876 | 0.0878 - 0.0449 |
| 6 | 177 | 1066 | 0.6009 | 3 | 0.0053 | 0.9929 | 0.0446 - 0.0261 |
| 7 | 177 | 1243 | 0.7007 | 2 | 0.0036 | 0.9964 | 0.0261 - 0.0159 |
| 8 | 177 | 1420 | 0.8005 | 2 | 0.0036 | 1.0000 | 0.0159 - 0.0085 |
| 9 | 177 | 1597 | 0.9002 | 0 | 0.0000 | 1.0000 | 0.0084 - 0.0026 |
| 10 | 177 | 1774 | 1.0000 | 0 | 0.0000 | 1.0000 | 0.0026 - 0.0 |
def decile_analysis_plot(y):
plt.figure(figsize=(7, 4))
plt.bar(np.arange(1, 11,1), y)
plt.xticks(np.arange(1, 11, 1))
plt.xlabel('Decile')
plt.ylabel('True Fraud Rate')
plt.title('Fraud % Rate by Decile');
decile_analysis_plot(decile_analysis["True label %"])
Estoy empezando a aprender más a profundidad de este modelo con el libro Effective XGBoost, de Matt Harrison.
Si bien no lo comprendo lo suficiente todavía, su implementación nivel 'principiante' ha dado buenos resultados, por lo que lo consideraría sobre la Regresión Logística. Por cuestiones de tiempo, lo dejaré así.
from xgboost import XGBClassifier
import xgboost as xgb
xgb_clf = XGBClassifier()
xgb_clf.fit(x_train1, y_train1, eval_metric='aucpr')
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)y_train_pred = xgb_clf.predict(x_train1)
y_test_pred = xgb_clf.predict(x_test1)
probs_xgb = xgb_clf.predict_proba(x_test1)
print_score(y_train1, y_train_pred, train=True)
print_score(y_test1, y_test_pred, train=False)
Train Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 1.0 1.0 1.0 1.0 1.0
recall 1.0 1.0 1.0 1.0 1.0
f1-score 1.0 1.0 1.0 1.0 1.0
support 3519.0 1802.0 1.0 5321.0 5321.0
__________________________________________________________________________
Confusion Matrix:
[[3519 0]
[ 0 1802]]
Test Result:
==========================================================================
__________________________________________________________________________
Classification Report:
0 1 accuracy macro avg weighted avg
precision 0.987675 0.983842 0.986471 0.985758 0.986458
recall 0.992568 0.973357 0.986471 0.982963 0.986471
f1-score 0.990115 0.978571 0.986471 0.984343 0.986452
support 1211.000000 563.000000 0.986471 1774.000000 1774.000000
__________________________________________________________________________
Confusion Matrix:
[[1202 9]
[ 15 548]]
CM(y_test1, y_test_pred)
# # Hay que correr de nuevo la función decile_analysis
# decile_analysis = decile_analysis(probs_xgb[:, 1], y_test1)
# decile_analysis
# decile_analysis_plot(decile_analysis["True label %"])
En un contexto real, cada paso en el proceso de Ciencia de Datos es crucial pues están en juego muchas cosas importantes:
Los resultados de los modelos literalmente tienen impacto en esto, si bien un simple uno como output puede ayudar a prevenir un fraude, también puede significar una acusación incorrecta. Es crucial tener modelos capaces de reducir estos errores y lograr la máxima efectividad, considero que hay muchas maneras de ahorrar recursos computacionales y económicos asociados a un modelo que no ponen en juego su efectividad; es importante tener un modelo efectivo, pues hay posibles ganancias o pérdidas asociadas directamente a esto, aunque por supuesto que no se debe dejar de lado buscar un balance entre costos, procesamiento y efectividad. Se trata de un proceso recursivo en donde los resultados de modelos pasados, alimentan futuros modelos, esto es, una inversión.
Puntos que mejoraría del proyecto:
end_time = time.time()
total_time = end_time - start_time
print(f"Total runtime: {total_time:.2f} seconds")
Total runtime: 31.86 seconds